\section{Introduction}
There are more than 7000 languages in this world~\cite{ethnologue}, which fall into more than 140 genetic families
having descended from a common ancestor. The aim of traditional historical linguistics is to trace the evolutionary
path,
a tree of extant languages to their extinct common ancestor. Genealogical relationship is not the only
characteristic which relates languages; languages can also share structurally common features such as \emph{word
order}, \emph{similar phoneme inventory size} and \emph{morphology}. %For instance, Finnish and Telugu, which
%are geographically remote and yet have a agglutinative morphology.
It would be a grave error to posit that two
languages are genetically related due to a single common structural feature. There have been
attempts in the past~\cite{nichols1995diachronically} to rank the stability of structural features. Stability implies
the resistance of a structural feature to change across space and time. For instance, Dravidian languages have adhered
to subject-object-verb (SOV) word order for the last two thousand years~\cite{krishnamurti2003dravidian,dunn84ger}.
Hence, it can be claimed that the structural feature SOV is very stable in Dravidian language family. Also, structural
features have recently been used for inferring the
evolutionary tree of a small group of Papuan languages of the Pacific~\cite{dunn2005structural}.

In the area of computational linguistics, genealogical distances between two language families have been shown  to be
useful for predicting the difficulty of machine translation~\cite{birch:2008}. However, the use of typological distances
in the development of various NLP tools largely remains unexplored. Typologically similar languages provide useful
leverage when working with low-resource languages.
% We are not aware of any other study, apart from the work of~\cite{pecina2010lexical} who, conducts an
% extensive empirical study of comparing 75 collocation measures for the task of collocation extraction from corpora.
In this paper, we compare typological distances with family internal classification and also within-family lexical
divergence.

The paper is structured as followed. In Section~\ref{sec:related}, we summarize the related work.
Section~\ref{sec:contributions} lists the contributions of this work. Section
\ref{sec:database} describes the typological database, lexical database and the criteria for preparing the final
dataset. Section~\ref{sec:measures} presents the different vector similarity measures and the evaluation procedure. The
results of our experiments are given in Section~\ref{sec:results}. We conclude the paper and discuss the future
directions in Section~\ref{sec:conclusions}.

\section{Related Work}\label{sec:related}
\citet{dunn2005structural} were the first to apply a well-tested computational phylogenetic method (from computational
biology), Maximum Parsimony (MP;~\citealt{felsenstein2004inferring}) to
typological features (phonological, syntactic and morphological). The authors used MP to classify a set of
unrelated languages -- in Oceania -- belonging to two different families.
In another related work,~\citet{wichmann2007use} apply three different phylogenetic algorithms -- Neighbor
Joining~\cite{saitou1987neighbor}, MP and Bayesian inference~\cite{huelsenbeck2001mrbayes} -- to the typological
features (from WALS) of 63 native American languages. They also ranked the typological features in terms of
stability.~\citet{nichols2008tutorial} survey the use of typological features for
language classification in computational historical linguistics. In a novel
work,~\citet{bakker2009adding} combine typological distances with lexical similarity to boost the language
classification accuracy. As a first step, they compute the pair-wise typological distances for 355 languages, obtained
through the application of length normalized Hamming distance to 85 typological features (ranked
by~\citet{wichmann2009assessing}). They combine the typological distances with lexical divergence, derived from
lexicostatistical lists, to boost language classification accuracy. Unfortunately, these works seem to have gone
unnoticed in computational linguistics.

Typological feature such as phoneme inventory size (extracted from WALS
database; \citet{haspelmath2008wals}) was used by \citet{atkinson2011phonemic} to claim that the phoneme inventory size
shows a negative correlation as one moves away from Africa\footnote{Assuming a monogenesis hypothesis of language
similar to the monogenesis hypothesis of \emph{homo sapiens}.}. In another work, \citet{dunn2011evolved} make an effort
towards demonstrating that there are lineage specific trends in the word order universals across the families of the
world.

In computational linguistics,~\citet{daume2009non} and~\citet{georgi2010comparing} use typological features from WALS
for investigating relation between phylogenetic groups and feature stability. \citet{georgi2010comparing} motivate the
use of typological features for projecting linguistic resources such as treebanks and bootstrapping NLP tools
from ``resource-rich'' to ``low-resource'' languages which are genetically unrelated yet, share similar syntactic
features due to contact (ex., Swedish to Finnish or vice-versa).~\citet{georgi2010comparing} compute pair-wise
distances from typological feature vectors using cosine similarity and a shared overlap measure (ratio of number of
shared features to the total number of features, between a pair of feature vectors). They apply three different
clustering algorithms -- k-means, partitional, agglomerative -- to the WALS dataset with number of clusters as testing
parameter and observe that the clustering performance measure (in terms of F-score) is not the best when the number of
clusters agree with the exact number of families ($121$) in the whole-world dataset. They find that the simplest
clustering algorithm, k-means, wins across all the three datasets. However, the authors do not correct for geographical
bias in the dataset.
% \citet{georgi2010comparing} work with three subsets of WALS database (after applying a pruning procedure described in
% Section~\ref{sec:database}). The first subset consists of $735$ languages across the world. Both the second and third
% dataset are subsets of the first subset and consist of languages belonging to Indo-European and Sino-Tibetan
% language families.
% They divide their dataset into $10$-folds and train the three clustering algorithms
% on $90\%$ of the data to predict the remaining $10\%$ of the features.

% Finally, the features are ranked in the decreasing order of their
% prediction accuracy to yield a stability ranking of the features.

\section{Contributions}\label{sec:contributions}
In this article, we depart from~\citet{georgi2010comparing} by not investigating the much researched topics of feature
stability and the feature prediction accuracy of clustering measures. Rather, we try to answer the following questions:

\begin{compactitem}
\item Do we really need a clustering algorithm to measure the internal classification accuracy of a language family?
% \emph{Internal classification} accuracy is a measure of closeness of the typological distances to the internal
% structure of a language family.
\item How well do the typological distances within a family correlate with the lexical distances derived from
lexicostatistical lists~\cite{swadesh1952lexico,wichmannasjp14}, originally proposed for language classification?
 \item Given that there are more than dozen vector similarity measures, which vector similarity measure is the best for
the above mentioned tasks?
\end{compactitem}

\section{Database}\label{sec:database}
In this section, we describe WALS and \emph{Automated Similarity Judgment Program}
(ASJP), the two databases used in our experiments.

\subsection*{WALS}
 The WALS database\footnote{Accessed on 2011-09-22.} has
144 feature types for 2676 languages distributed across the globe. As noted by~\cite{hh13}, the WALS database is sparse
across many language families of the world and the dataset needs pruning before it is used for further investigations.
The database is represented as matrix of languages vs. features. The pruning of the dataset has to be done in both the
directions to avoid sparsity when computing the pair-wise distances between languages.
Following~\citet{georgi2010comparing}, we remove all the languages which have less than $25$ attested features. We also
remove features with less than $10\%$ attestations. This leaves the dataset with $1159$ languages and
$193$ features. Our dataset includes only those families having more than 10
language (following~\citealt{wichmann2010evaluating}), shown in Table~\ref{tab:tab1}.~\citet{georgi2010comparing}
work with a pruned dataset of 735 languages and two major families Indo-European and Sino-Tibetan whereas, we stick to
investigating the questions in Section~\ref{sec:contributions} for the well-defined language families -- Austronesian,
Afro-Asiatic -- given in Table~\ref{tab:tab1}.

\begin{table}[htb]
\centering
\begin{tabular}{|lc|lc|}
\hline
Family & Count & Family & Count \\
\hline
Austronesian  &  150 (141) & Austro-Asiatic  &  22 (21) \\
Niger-Congo  &  143 (123) & Oto-Manguean  &  18 (14) \\
Sino-Tibetan  &  81 (68) & Arawakan  &  17 (17) \\
Australian  &  73 (65) & Uralic  &  15 (12) \\
Nilo-Saharan  &  69 (62) & Penutian  &  14 (11) \\
Afro-Asiatic  &  68 (57) & Nakh-Daghestanian  &  13 (13) \\
Indo-European  &  60 (56) & Tupian  &  13 (12) \\
Trans-New Guinea  &  43 (33) & Hokan  &  12 (12) \\
Uto-Aztecan  &  28 (26) & Dravidian  &  10 (9) \\
Altaic  &  27 (26) & Mayan  &  10 (7) \\
\hline
\end{tabular}
\caption{Number of languages in each family. The number in paranthesis for each family gives the number of languages
present in the database after mapping with ASJP database.}
\label{tab:tab1}
\end{table}

\subsection*{ASJP}
A international consortium of scholars (calling themselves
ASJP; \citealt{brown2008automated}) started collecting Swadesh word lists~\cite{swadesh1952lexico} (a short concept
meaning list usually ranging from
40--200) for most of the world's languages (more than 58\%), in the hope of automatizing the language
classification of world's languages~\footnote{Available at: \url{{http://email.eva.mpg.de/~wichmann/listss14.zip}}}. The
ASJP lexical items are transcribed using a broad phonetic transcription called ASJP Code~\cite{brown2008automated}. The
ASJP Code collapses distinctions in vowel length, stress, tone and reduces all click sounds to a single click symbol.
This database has word lists for a language (given by its unique ISO 693-3 code as well as WALS code) and its dialects.
We use the WALS code to map the languages in WALS database with that of ASJP database. Whenever a language with a WALS
code has more than one word list in ASJP database, we chose to retain the first language for our experiments. An
excerpt of word list for Russian is shown in Figure~\ref{fig:fig1}. The first line consists of name of language, WALS
classification (Indo-European family and Slavic genus), followed by Ethnologue classification (informing that Russian
belongs to Eastern Slavic subgroup of Indo-European family). The second line consists of the latitude, longitude,
number of speakers, WALS code and ISO 693-3 code. Lexical items begin from the third line.

\begin{figure}[h!]
\centering
\includegraphics[height=0.2\textwidth, width=0.4\textwidth]{asjp_sample.png}
\caption{10 lexical items in Russian.}
\label{fig:fig1}
\end{figure}

\subsection*{Binarization}
Each feature in the WALS dataset is either a binary feature (presence or absence of the feature in a language) or a
multi-valued feature, coded as a discrete integers over a finite range. ~\citet{georgi2010comparing} binarize the
feature values by recording the presence or absence of a feature value in a language. This binarization greatly expands
the length of the feature vector for a language but allows to represent a wide-ranged feature such as \emph{word order}
(which has $7$ feature values) in terms of a sequence of $1$'s  and $0$'s. The issue of binary vs. multi-valued features
has been a point of debate in genetic linguistics and has been shown to not give very different results for the
Indo-European classification~\cite{Atkinson:06}.

\section{Measures}\label{sec:measures}
% In this section, we discuss the two measures for evaluating the vector similarity measures in terms of internal
% classification and the computation of lexical distances for ASJP word lists.
In this section, we present the 15 vector similarity measures (shown in Table~\ref{tbl:measures}) followed by the
evaluation measure for comparing typological distances to WALS classification. Next, we present the ASJP lexical
divergence computation procedure.

% \subsection*{Vector similarity measures}
\begin{table}[hbt]
\linespread{1.6}
% \centering
\begin{minipage}[t]{0.38\textwidth} \footnotesize
\begin{center}
\begin{tabular}{|l|c|}
\hline
\multicolumn{2}{|c|}{\bf Vector similarity} \\
\hline
euclidean & $\sqrt[2]{\Sigma_{i=1}^{n} (v_1^i-v_2^i)^2}$ \\
seuclidean & $\Sigma_{i=1}^{n} (v_1^i-v_2^i)^2$ \\
nseuclidean & $\dfrac{\norm{\sigma_1-\sigma_2}}{2*\norm{\sigma_1}+\norm{\sigma_2}}$ \\
manhattan & $\Sigma_{i=1}^{n} \abs{v_1^i-v_2^i}$ \\
chessboard & $max((v_1^i-v_2^i) \forall i\in (1, n))$\\
braycurtis & $\dfrac{\Sigma_{i=1}^{n} \abs{v_1^i-v_2^i}}{\Sigma_{i=1}^{n} \abs{v_1^i+v_2^i}}$ \\
%canberra &
cosine & $\dfrac{v_1\cdot v_2}{\norm{v_1}*\norm{v_2}}$ \\
correlation & $1-\dfrac{\sigma_1\cdot \sigma_2}{\norm{\sigma_1}*\norm{\sigma_2}}$ \\
\hline
\end{tabular}
\end{center}
\end{minipage}
\begin{minipage}[t]{0.62\textwidth} \footnotesize
\begin{center}
\begin{tabular}{|l|c|}
\hline
\multicolumn{2}{|c|}{\bf Boolean similarity} \\
\hline
hamming & $\#_{\ne 0}(v_1 \XOR v_2)$ \\
jaccard & $\dfrac{\#_{\ne 0}(v_1 \XOR v_2)}{\#_{\ne 0}(v_1 \XOR v_2)+\#_{\ne 0}(v_1 \& v_2)}$ \\
tanimoto & $\dfrac{2*\#_{\ne 0}(v_1 \XOR v_2)}{\#_{\ne 0}(v_1 \& v_2)+\#_{= 0}(v_1 \| v_2)+2*\#_{\ne 0}(v_1 \XOR v_2)}$
\\
matching & $\dfrac{\#_{\ne 0}(v_1 \XOR v_2)}{\#v_1}$ \\
dice & $\dfrac{\#_{\ne 0} (v_1 \XOR v_2)}{\#_{\ne 0} (v_1 \XOR v_2) + 2*\#_{\ne 0} (v_1 \& v_2)}$ \\
sokalsneath & $\dfrac{2*\#_{\ne 0} (v_1 \XOR v_2)}{2*\#_{\ne 0} (v_1 \XOR v_2) + \#_{\ne 0} (v_1 \& v_2)}$ \\
russellrao & $\dfrac{\#_{\ne 0} (v_1 \XOR v_2)+\#_{= 0} (v_1 \| v_2)}{\# v_1}$ \\
yule & $\dfrac{2*\#_{\ne 0} (v_1 - v_2)*\#_{= 0} (v_1 - v_2)}{\#_{\ne 0} (v_1 - v_2)*\#_{= 0} (v_1 - v_2) + \#_{\ne
0}(v_1 \& v_2)*\#_{= 0} (v_1 \| v_2)}$ \\
\hline
\end{tabular}
\end{center}
\end{minipage}
\linespread{1}
\caption{Different vector similarity measures used in our experiments (distance computed between $v_1$ and $v_2$). In
vector similarity measures, $\norm{ }$ represents the L$_2$ norm of the vector, and $\sigma$ represents the difference
from mean of vector ($\mu_1$) i.e. $(v_1 - \mu_1)$. Similarly, for the boolean similarity measures, $\XOR$ stands for
the logical XOR operation between bit vectors while $\&$ and $\|$ stand for logical AND and OR operations respectively.
$\#_{\ne 0}\left ( \cdot  \right )$ stands for number of non-zero bits in a boolean vector.}
\label{tbl:measures}
\end{table}

% Sample points of all the distance measures for the top five language families.

\subsection*{Internal classification accuracy}
Apart from typological information for the world's languages, WALS also provides a two-level classification of a
language
family. In the WALS classification, the top level is the family name, the next level is genus and a language rests at
the
bottom. For instance, Indo-European family has $10$ genera. Genus is a consensually defined unit and not a rigorously
established genealogical unit~\cite{hh13}. Rather, a genus corresponds to a group of languages which are supposed to
have descended from a proto-language which is about $3500$ to $4000$ years old. For instance, WALS lists Indic
and Iranian languages as separate genera whereas, both the genera are actually descendants of Proto-Indo-Iranian
which in turn descended from Proto-Indo-European -- a fact well-known in historical
linguistics~\cite{campbell2008language}.

The WALS classification for each language family listed in Table~\ref{tab:tab1}, can be represented as a
2D-matrix with languages along both rows and columns. Each cell of such a matrix represents the WALS relationship in a
language pair in the family. A cell has 0 if a language pair belong to the same genus and 1 if they belong to different
genera. The pair-wise distance matrix obtained from each vector similarity measure is compared to the 2D-matrix using a
special case of pearson's $r$, called point-biserial
correlation\footnote{\url{http://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient}}.
\subsection*{Lexical distance}
The ASJP program computes the distance between two languages as the average pair-wise length-normalized Levenshtein
distance, called Levenshtein Distance Normalized (LDN)~\cite{levenshtein1965binary}. LDN is further modified
to account for chance resemblance such as accidental phoneme inventory similarity between a pair of languages
to yield LDND (Levenshtein Distance Normalized
Divided;~\citealt{holman2008proceedings}). The performance of LDND distance
matrices was evaluated against two expert classifications of world's
languages in at least two recent works~\cite{pompei2011accuracy,wichmann2012correlates}. Their findings confirm that
the LDND matrices largely agree with the classification given by historical linguists. This result puts us on a strong
ground to use ASJP's LDND as a measure of lexical divergence within a family.

The distribution of the languages included in this study is plotted in Figure~\ref{fig:fig2}.
\begin{figure}[h!]
\centering
\includegraphics[height=0.25\textwidth, width=0.5\textwidth]{wals-langs.png}
\caption{Visual representation of world's languages in the final dataset.}
\label{fig:fig2}
\end{figure}

The correlation between typological distances and lexical distances is (within a family) computed as the spearman's
rank correlation $\rho$ between the typological and lexical distances for all language pairs in the family. It is worth
noting that~\citet{bakker2009adding} also compare LDND distance matrices with WALS distance matrices for $355$ languages
from various families using a pearson's $r$ whereas, we compare within-family LDND matrices with WALS distance matrices
derived from $15$ similarity measures.

\section{Results}\label{sec:results}
In this section, we present and discuss the results of our experiments in internal classification and correlation with
lexical divergence. We use heat maps to visualize the correlation matrices resulting from both experiments.
\subsection*{Internal classification}
The point bi-serial correlation, $r$, introduced in Section~\ref{sec:measures}, lies in the range of $-1$ to $+1$. The
value of $r$ is blank for Arawakan and Mayan families since both families have a single genus in their respective
WALS classifications. Subsequently, $r$ is shown in white for both of these families. Chessboard measure is blank
across all language families since chessboard gives a single score of $1$ between two binary vectors. Interestingly,
all vector similarity measures perform well for Australian, Austro-Asiatic, Indo-European and, Sino-Tibetan language
families, except for `russellrao'. We take this result as quite encouraging, since they consist of more than 33\% of
the total languages in the sample given in Table~\ref{tab:tab1}. Among the measures, `matching', `seuclidean',
`tanimoto', `euclidean', `hamming' and `manhattan' perform the best across the four families. Interestingly, the
widely used `cosine' measure does not perform as well as `hamming'. None
of the vector similarity measures seem to perform well for Austronesian and Niger-Congo families which
have more than $14\%$ and $11\%$ of the world's languages respectively. The worst performing language family is
Tupian. This does not come as a surprise, since Tupian has 5 genera with one language in each and a single genus
comprising the rest of family. Australian and Austro-Asiatic families shows the maximum correlation across `seuclidean',
`tanimoto', `euclidean', `hamming' and `manhattan'.

\begin{figure}[htb]
\centering
\includegraphics[width=\textwidth, height=0.5\textwidth]{WALS-point-biserial-correlation.png}
\caption{Heatmap showing the gradience of $r$ across different language families and vector similarity measures.}
\label{fig:fig3} %
% \end{center}
\end{figure} %
\subsection*{Lexical divergence}
The rank correlation between LDND and vector similarity measures is high across Australian, Sino-Tibetan, Uralic,
Indo-European and Niger-Congo families. The `Russel-Rao' measure works the best for families -- Arawakan,
Austro-Asiatic, Tupian and Afro-Asiatic -- which otherwise have poor correlation scores for the rest of measures.
The maximum correlation is for `yule' measure in Uralic family. Indo-European, the well-studied family,
shows a correlation from $0.08$ to the maximum possible correlation across all measures, except for `Russell-Rao' and
`Bray-Curtis' distances. It is not clear why Hokan family shows the lowest amount of correlation across all the
families.
\begin{figure}[htb]
\centering
% \epsfig{WALS-ASJP-Rank-correlation_2, height = 0.5\textwidth, width=0.8\textwidth}
\includegraphics[width=\textwidth, height=0.5\textwidth]{WALS-ASJP-Rank-correlation.png}
\caption{Heatmap showing the gradience of $\rho$ across different families and vector similarity measures.}
\label{fig:fig4} %
% \end{center}
\end{figure} %

\section*{Conclusion}\label{sec:conclusions}
In summary, choosing the right vector similarity measure when calculating typological distances makes a difference in
the internal classification accuracy. The choice of similarity measure does not influence the correlation between WALS
distances and LDND distances within a family. The internal classification accuracies are similar to the accuracies
reported in~\citet{bakker2009adding}. Our correlation matrix suggests that internal classification accuracies of LDND
matrices (reported in~\citealt{bakker2009adding}) can be boosted through the right combination of typological distances
and lexical distances. In our experiments, we did not control for feature stability and experimented on all available
features. By choosing a smaller set of typological features (from the ranking of~\citet{wichmann2009assessing}) and
right similarity measure one might achieve higher accuracies. The current rate of language extinction is unprecedented
in human history. Our findings might be helpful in speeding up the language classification of many small dying families
by serving as a springboard for traditional historical linguists.


